White wine Quality is a tidy data set which contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine.At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
library(ggplot2)
library(dplyr)
library(gridExtra)
library(knitr)
Whitewine = read.csv('wineQualityWhites.csv',header = T, row.names = 1)
names(Whitewine)
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
All the variablesare numbers and there exist no factor type in the dataset.
summary(Whitewine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
Quality values are between 3 and 9.Median and Mean is very close to each other which means the distribution is not so-skewed. ***
ggplot(aes(x = quality), data = Whitewine) + geom_bar() +
scale_y_continuous(breaks = seq(0,2250,250)) +
scale_x_continuous(limits = c(3,10), breaks = seq(3,9,1))
The plot shows that the quality of the wines are highest at value of 6. There exist very few wines having quality score of 9.
Let’s explore the other variables of the dataset and plot their distributions.
grid.arrange(ggplot(aes(x = fixed.acidity), data = Whitewine) +
geom_histogram(),
ggplot(aes(x = 1, y = residual.sugar ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Outliers
ggplot(aes(x = fixed.acidity), data = Whitewine) +
geom_histogram(binwidth = 0.1) +
scale_x_continuous(breaks = seq(0,15,1))
The distribution of acidity is very close to normal distribution. But there are some outliers in the data.
summary(Whitewine$fixed.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
It can be seen from summary table and outlier graph that there exist few data points between 3rd Quantile and Max values.
After trim top 1 percentile , the below graph below which wil be normal.
ggplot(aes(x = fixed.acidity), data = Whitewine) +
geom_histogram(binwidth = 0.1) +
scale_x_continuous(breaks = seq(0,15,1),
limits = c(quantile(Whitewine$fixed.acidity, 0.01) ,
quantile(Whitewine$fixed.acidity, 0.99)))
## Warning: Removed 75 rows containing non-finite values (stat_bin).
grid.arrange(ggplot(aes(x = volatile.acidity), data = Whitewine) +
geom_histogram(),
ggplot(aes(x = 1, y = volatile.acidity ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(aes(x = volatile.acidity), data = Whitewine) +
geom_histogram(binwidth = 0.01) +
scale_x_continuous(breaks = seq(0,1.1,0.1))
summary(Whitewine$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
In the dataset, there are some extreme points which make dataset skewed.Lets trim the top 1 percentile ,The below graph is obtained:
ggplot(aes(x = volatile.acidity), data = Whitewine) +
geom_histogram(binwidth = 0.01) +
scale_x_continuous(breaks = seq(0.1,1.1,0.1),
limits = c(0,quantile(Whitewine$volatile.acidity, 0.99)))
## Warning: Removed 48 rows containing non-finite values (stat_bin).
grid.arrange(ggplot(aes(x = citric.acid), data = Whitewine) +
geom_histogram(),
ggplot(aes(x = 1, y = citric.acid ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(aes(x = citric.acid), data = Whitewine) +
geom_histogram(binwidth = 0.01) +
scale_x_continuous(breaks = seq(0.1,1.5,0.1))
In citric acid feature there exist so many high and low outlier values. Therfore trimming them will make distribution better.Extra attention should be given to 0.5 point.
summary(Whitewine$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
In citric acid values, there are also extreme high values .Lets omit the top 1 percentile then we are able to get close normal like distribution:
ggplot(aes(x = citric.acid), data = Whitewine) +
geom_histogram(binwidth = 0.01) +
scale_x_continuous(breaks = seq(0.1,1.5,0.1),
limits = c(0,quantile(Whitewine$citric.acid, 0.99)))
## Warning: Removed 22 rows containing non-finite values (stat_bin).
table(Whitewine$citric.acid)
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 19 7 6 2 12 5 6 12 4 12 14 1 19 17 27
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 23 33 27 49 48 70 66 104 83 181 136 219 216 282 223
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 307 200 257 183 225 137 177 134 122 101 117 82 95 37 63
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 46 51 38 39 215 35 25 23 16 19 11 22 13 21 6
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 6 9 14 4 6 8 7 7 7 5 3 9 5 5 41
## 0.78 0.79 0.8 0.81 0.82 0.86 0.88 0.91 0.99 1 1.23 1.66
## 2 2 2 2 2 1 1 2 1 5 1 1
grid.arrange(ggplot(aes(x = residual.sugar), data = Whitewine) +
geom_histogram(),
ggplot(aes(x = 1, y = residual.sugar ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From outliers we can see so few high outlier points.Therfore trimmimng it to get better results.
ggplot(aes(x = residual.sugar), data = Whitewine) +
geom_histogram(binwidth = 0.1)
summary(Whitewine$residual.sugar)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
Residual.sugar distribution is highly skewed.There exist few extremely high values but no outliers.
After trimming ,the below graph is obtained:
ggplot(aes(x = residual.sugar), data = Whitewine) +
geom_histogram(binwidth = 1, fill = '#5760AB') +
scale_x_continuous( limits = c(0.6,
quantile(Whitewine$residual.sugar, 0.99)),
breaks = seq(0,50,1))
## Warning: Removed 47 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
In this part the distribution is multimodal. Therefore, many wines with various residual sugar levels exist. One includes very little residual.sugar, one is sweet(5), other is sweet(approx. 8).
grid.arrange(ggplot(aes(x = chlorides), data = Whitewine) +
geom_histogram(color = 'Black', fill = '#F79420'),
ggplot(aes(x = 1, y = chlorides ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The box plot has huge amount of outliers which means the distribution is highly skewed.It is difficult to understand the graph using bin sizes so we should narrow down them for better visualization.
summary(Whitewine$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
ggplot(aes(x = chlorides), data = Whitewine) +
geom_histogram(binwidth = 0.001, fill = '#5760AB')
The distribution is good but the spread of data is wide.We will omit 1 % of data for more clear visualization.
ggplot(aes(x = chlorides), data = Whitewine) +
geom_histogram(binwidth = 0.001, fill = '#5760AB') +
scale_x_continuous( limits = c(0, quantile(Whitewine$chlorides, 0.99)))
## Warning: Removed 48 rows containing non-finite values (stat_bin).
We can see that most of the data is clustered around 0.05, there exist considerable amount of data above 0.05. Large amount is aggregated around 0.5 and wide spread value greater than 0.10.
table(Whitewine$chlorides)
##
## 0.009 0.012 0.013 0.014 0.015 0.016 0.017 0.018 0.019 0.02 0.021 0.022
## 1 1 1 4 4 5 5 10 9 16 19 19
## 0.023 0.024 0.025 0.026 0.027 0.028 0.029 0.03 0.031 0.032 0.033 0.034
## 20 34 30 54 58 85 81 108 107 109 119 168
## 0.035 0.036 0.037 0.038 0.039 0.04 0.041 0.042 0.043 0.044 0.045 0.046
## 130 200 160 167 157 182 147 184 141 201 170 181
## 0.047 0.048 0.049 0.05 0.051 0.052 0.053 0.054 0.055 0.056 0.057 0.058
## 171 174 133 170 115 104 130 99 61 88 68 53
## 0.059 0.06 0.061 0.062 0.063 0.064 0.065 0.066 0.067 0.068 0.069 0.07
## 36 46 19 25 23 15 8 18 18 7 18 6
## 0.071 0.072 0.073 0.074 0.075 0.076 0.077 0.078 0.079 0.08 0.081 0.082
## 5 2 5 8 2 9 1 2 4 4 2 2
## 0.083 0.084 0.085 0.086 0.087 0.088 0.089 0.09 0.091 0.092 0.093 0.094
## 5 5 3 4 3 2 1 2 1 3 3 5
## 0.095 0.096 0.097 0.098 0.099 0.102 0.104 0.105 0.108 0.11 0.112 0.114
## 2 6 1 3 1 1 1 1 2 3 1 1
## 0.115 0.117 0.118 0.119 0.12 0.121 0.122 0.123 0.126 0.127 0.13 0.132
## 1 3 1 3 1 2 1 4 3 2 1 1
## 0.133 0.135 0.136 0.137 0.138 0.142 0.144 0.145 0.146 0.147 0.148 0.149
## 1 1 1 2 2 3 1 1 1 2 1 1
## 0.15 0.152 0.154 0.156 0.157 0.158 0.16 0.167 0.168 0.169 0.17 0.171
## 1 2 1 1 4 1 2 2 3 2 2 1
## 0.172 0.173 0.174 0.175 0.176 0.179 0.18 0.184 0.185 0.186 0.194 0.197
## 2 2 2 2 2 1 1 2 2 1 1 2
## 0.2 0.201 0.204 0.208 0.209 0.211 0.212 0.217 0.239 0.24 0.244 0.255
## 1 2 1 2 1 1 1 1 1 1 1 1
## 0.271 0.29 0.301 0.346
## 1 1 1 1
grid.arrange(ggplot(aes(x = free.sulfur.dioxide), data = Whitewine) +
geom_histogram(),
ggplot(aes(x = 1, y = free.sulfur.dioxide ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are so many outliers as most than other features.Trimming will make it better analysis. Lets arrange first binwidths for deeper insight.
ggplot(aes(x = free.sulfur.dioxide), data = Whitewine) +
geom_histogram(binwidth = 1)
Lets check summary statistics :
summary(Whitewine$free.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
There exists some extremely large variables similar to others.if top 1 percentile is omitted:
ggplot(aes(x = free.sulfur.dioxide), data = Whitewine) +
geom_histogram(binwidth = 1, fill = '#5760AB') +
scale_x_continuous( limits = c(0, quantile(Whitewine$free.sulfur.dioxide, 0.99)))
## Warning: Removed 43 rows containing non-finite values (stat_bin).
This time , the distribution is quite better and similar to normal.The skewness is also low as compared to earlier one.
grid.arrange(ggplot(aes(x = total.sulfur.dioxide), data = Whitewine) +
geom_histogram(),
ggplot(aes(x = 1, y = total.sulfur.dioxide ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Whitewine$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
The data is similar to previous variables.There also exist extremely large variables and a few outliers but most of the data has a bell-shaped normal distribution.Lets try to omit top 1 percentile ,thus below distribution is obtained:
ggplot(aes(x = total.sulfur.dioxide), data = Whitewine) +
geom_histogram(binwidth = 1, fill = '#5760AB') +
scale_x_continuous( limits = c(0, quantile(Whitewine$total.sulfur.dioxide, 0.99)))
## Warning: Removed 49 rows containing non-finite values (stat_bin).
As we can see that most of the data is between 50 and 240.
grid.arrange(ggplot(aes(x = density), data = Whitewine) +
geom_histogram(),
ggplot(aes(x = 1, y = density ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(Whitewine$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The spread of density is very narrow.There are nearly no outliers.Lets use smaller bin sizes :
ggplot(aes(x = density), data = Whitewine) +
geom_histogram(binwidth = 0.0001)
Most of the density values are between 0.98 and 1. Lets omit top 1 percentile:
ggplot(aes(x = density), data = Whitewine) +
geom_histogram(binwidth = 0.0001) +
scale_x_continuous( limits = c(0.9871, quantile(Whitewine$density, 0.99)))
## Warning: Removed 49 rows containing non-finite values (stat_bin).
Most of the density data are accumulated between 0.990 and 0.997.
grid.arrange(ggplot(aes(x = pH), data = Whitewine) +
geom_histogram(),
ggplot(aes(x = 1, y = pH ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution is not skewed and seems like bell-shaped but there are some outliers.Outliers exists on both sides which makes the distribution not skewed.If the binsize is narrowed:
ggplot(aes(x = pH), data = Whitewine) +
geom_histogram(binwidth = 0.01)
summary(Whitewine$pH)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The spread is quite narrow and has ignorable skewness.
grid.arrange(ggplot(aes(x = sulphates), data = Whitewine) +
geom_histogram(),
ggplot(aes(x = 1, y =sulphates ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The distribution is bit right skewed.However there are few outliers.Lets narrow bin size:
ggplot(aes(x = pH), data = Whitewine) +
geom_histogram(binwidth = 0.01)
summary(Whitewine$sulphates)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
After omitting top 1 percentile:
ggplot(aes(x = sulphates), data = Whitewine) +
geom_histogram(binwidth = 0.01) +
scale_x_continuous( limits = c(0.22, quantile(Whitewine$sulphates, 0.99)))
## Warning: Removed 48 rows containing non-finite values (stat_bin).
There exist some right skewness but it is very close to bell- shaped distribution.
grid.arrange(ggplot(aes(x = alcohol), data = Whitewine) +
geom_histogram(),
ggplot(aes(x = 1, y =alcohol ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' ), ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(aes(x = alcohol), data = Whitewine) +
geom_histogram(binwidth = 0.1) +
scale_x_continuous(breaks = seq(8,14,1))
summary(Whitewine$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
There are still some outliers.The distribution seems to be multi-modal.These are (8.5-10),(10-11.5) and (11.5-13).The biggest aggregate exist (8.5-10) group. Most data exist at point 9.5. # Bivariate Plots
ggplot(aes(x = factor(quality), y = residual.sugar ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.3, color = 'blue' ) +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
ggplot(aes(x = residual.sugar, y = quality), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
xlim(0, quantile(Whitewine$residual.sugar, 0.99)) +
ylim(3, 9) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 47 rows containing non-finite values (stat_smooth).
## Warning: Removed 61 rows containing missing values (geom_point).
Average quality has very high variance conditional on residual sugar . For very close values of resdual sugar values, quality changes alot which means very low correlation.However extreme values have less quality.
Residual.sugar is between 1.5 and 5 the quality is best and highest mean of means.
Between 5 and 10 , variance in quality is very high nad quality mean reaches very high values.However mean of means is quite low.
ggplot(aes(x = factor(quality), y = alcohol ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.3, color = 'blue' ) +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
ggplot(aes(x = alcohol, y = quality), data = Whitewine) +
geom_jitter(alpha = 0.1)
ggplot(aes(x = alcohol, y = quality), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
geom_smooth()
## `geom_smooth()` using method = 'gam'
THe mean and mean of means makes a pattern in quality conditional on alcohol.If extreme values are trimmed:
ggplot(aes(x = alcohol, y = quality), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
xlim(quantile(Whitewine$alcohol, 0.01),
quantile(Whitewine$alcohol, 0.99)) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 78 rows containing non-finite values (stat_smooth).
## Warning: Removed 127 rows containing missing values (geom_point).
The trimmed model has better positive linear relationship between 9.5 and 1.3.Best qualities are between 12 and 13 alcohol level.
with(Whitewine, cor.test(alcohol, quality, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: alcohol and quality
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4126015 0.4579941
## sample estimates:
## cor
## 0.4355747
ggplot(aes(x = factor(quality), y = volatile.acidity ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.3, color = 'blue' ) +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
ggplot(aes(x = volatile.acidity, y = quality), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
geom_smooth()
## `geom_smooth()` using method = 'gam'
The graph shows that there is a negative relationship between volatile acidity and quality. Lets investigate the extreame points.
ggplot(aes(x = volatile.acidity, y = quality), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
xlim(quantile(Whitewine$volatile.acidity, 0.08), quantile(Whitewine$volatile.acidity, 0.99)) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 317 rows containing non-finite values (stat_smooth).
## Warning: Removed 388 rows containing missing values (geom_point).
Trimming the extreme high points decreased the slope, however a negative relationship is still clearly seen.After 0.5 volatile acidity , the slope(relationship strength) increases.
with(Whitewine, cor.test(volatile.acidity, quality, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and quality
## t = -13.891, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2215214 -0.1676307
## sample estimates:
## cor
## -0.194723
ggplot(aes(x = factor(quality), y = free.sulfur.dioxide ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.3, color = 'blue' ) +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
ggplot(aes(x = free.sulfur.dioxide, y = quality), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
xlim(2, quantile(Whitewine$free.sulfur.dioxide,0.99)) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 43 rows containing non-finite values (stat_smooth).
## Warning: Removed 47 rows containing missing values (geom_point).
Quality and free.sulfur.dioxide has a positive relationship between 0 and 30. After 40, mean of means decreases and falls down to quality level of 6.
ggplot(aes(x = factor(quality), y = total.sulfur.dioxide ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.3, color = 'blue' ) +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
ggplot(aes(x = total.sulfur.dioxide, y = quality), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
xlim(quantile(Whitewine$total.sulfur.dioxide,0.01), quantile(Whitewine$total.sulfur.dioxide,0.99)) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 98 rows containing non-finite values (stat_smooth).
## Warning: Removed 99 rows containing missing values (geom_point).
Plot shows that total.sulfur.dioxide have positive relationship with quality between 0 and 90. The slope becomes negative after 100, but strength of relationship is low .It is clearly seen that the qualityvalue is robust between 75 and 150.For small values ,quality is very volatile.
summary(Whitewine$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
ggplot(aes(x = factor(quality), y = chlorides ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.3, color = 'blue' ) +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
chlorides values are mostly cumulated around 0 and 0.1 .Lets take a look to the summary table:
summary(Whitewine$chlorides)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
From the summary we can see that even the 3rd quantile is 0.05.If we trim the extreme valuesand draw quality conditionals on chlorides:
ggplot(aes(x = chlorides, y = quality), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
xlim(0,quantile(Whitewine$chlorides,0.95)) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 237 rows containing non-finite values (stat_smooth).
## Warning: Removed 249 rows containing missing values (geom_point).
The plot of quality is very robust between 0.025 and 0.75 with a negative relationship with chlorides.The volatility increases after 0.10.
ggplot(aes(x = factor(quality), y = sulphates ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.3, color = 'blue' ) +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
ggplot(aes(x = sulphates, y = quality), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
xlim(quantile(Whitewine$sulphates,0.01),
quantile(Whitewine$sulphates,0.99)) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 84 rows containing non-finite values (stat_smooth).
## Warning: Removed 96 rows containing missing values (geom_point).
It is dificult to say that ther e exist any relationship between quality and sulphates visually .There is just small increase around 0.8 value.
Lets look after the other variables like citric.acid,fixed acidity,density and pH relationship with quality.
ggplot(aes(x = factor(quality), y = citric.acid ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.3, color = 'blue' ) +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
ggplot(aes(x = citric.acid, y = quality), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
xlim(quantile(Whitewine$citric.acid,0.01),
quantile(Whitewine$citric.acid,0.99)) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 68 rows containing non-finite values (stat_smooth).
## Warning: Removed 86 rows containing missing values (geom_point).
ggplot(aes(x = factor(quality), y = fixed.acidity ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.3, color = 'blue' ) +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
ggplot(aes(x = fixed.acidity, y = quality), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
xlim(quantile(Whitewine$fixed.acidity,0.01), quantile(Whitewine$fixed.acidity,0.99)) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 75 rows containing non-finite values (stat_smooth).
## Warning: Removed 99 rows containing missing values (geom_point).
ggplot(aes(x = factor(quality), y = density ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.3, color = 'blue' ) +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
ggplot(aes(x = density, y = quality), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
xlim(quantile(Whitewine$density,0.01),
quantile(Whitewine$density,0.99)) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 98 rows containing non-finite values (stat_smooth).
## Warning: Removed 98 rows containing missing values (geom_point).
with(Whitewine, cor.test(density, quality))
##
## Pearson's product-moment correlation
##
## data: density and quality
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3322718 -0.2815385
## sample estimates:
## cor
## -0.3071233
ggplot(aes(x = factor(quality), y = pH ), data = Whitewine) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.3, color = 'blue' ) +
stat_summary(fun.y = "mean",
geom = "point",
color = "red",
shape = 8,
size = 4)
ggplot(aes(x = pH, y = quality), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
xlim(quantile(Whitewine$pH,0.01),
quantile(Whitewine$pH,0.99)) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 85 rows containing non-finite values (stat_smooth).
## Warning: Removed 94 rows containing missing values (geom_point).
Quality does not seem to vary conditional on pH and fixed acidity. However, quality seems to have relationship between density and citric acid. Especialy denser wines seems to have less quality value on average.
Lets try to check conditional with alcohol.
ggplot(aes(x = alcohol, y = residual.sugar), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
ylim(0,quantile(Whitewine$residual.sugar,0.95)) +
coord_trans(y = 'sqrt') +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 240 rows containing non-finite values (stat_smooth).
## Warning: Removed 244 rows containing missing values (geom_point).
There is a decreasing trend in residual.sugar between 8 and 10 alcohol level.
ggplot(aes(x = alcohol, y = volatile.acidity), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
ylim(quantile(Whitewine$volatile.acidity,0.05), quantile(Whitewine$volatile.acidity,0.95)) +
coord_trans(y = 'sqrt') +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 400 rows containing non-finite values (stat_smooth).
## Warning: Removed 446 rows containing missing values (geom_point).
From the plot we can observe that there is increase for value more than 11.
with(subset(Whitewine, Whitewine$alcohol>11),
cor.test(volatile.acidity, alcohol, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and alcohol
## t = 14.107, df = 1559, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2917104 0.3797308
## sample estimates:
## cor
## 0.3364553
ggplot(aes(x = alcohol, y = total.sulfur.dioxide), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
ylim(quantile(Whitewine$total.sulfur.dioxide,0.05), quantile(Whitewine$total.sulfur.dioxide,0.95)) +
coord_trans(y = 'sqrt') +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 474 rows containing non-finite values (stat_smooth).
## Warning: Removed 485 rows containing missing values (geom_point).
There is decreasing trend of toatl.sulfur.dioxide for increasing alcohols.
with(Whitewine, cor.test(alcohol, total.sulfur.dioxide, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: alcohol and total.sulfur.dioxide
## t = -35.15, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4709775 -0.4262443
## sample estimates:
## cor
## -0.4488921
ggplot(aes(x = alcohol, y = density), data = Whitewine) +
geom_jitter(alpha = 0.1, color = 'orange') +
ylim(quantile(Whitewine$density,0.05), quantile(Whitewine$density,0.95)) +
coord_trans(y = 'sqrt') +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 483 rows containing non-finite values (stat_smooth).
## Warning: Removed 491 rows containing missing values (geom_point).
with(Whitewine, cor.test(alcohol, density, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: alcohol and density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7908646 -0.7689315
## sample estimates:
## cor
## -0.7801376
Density and alcohol have a strong negative relationship(-0.78) as seen in above graph and correlation calculations.
ggplot(aes(x = quality, y = alcohol), data = Whitewine) +
geom_boxplot(aes(group = quality))
High quality wines generally have high alcohol levels as per shown by the boxplot.
ggplot(aes(x = quality, y = residual.sugar), data = Whitewine) +
geom_boxplot(aes(group = quality)) +
ylim(quantile(Whitewine$residual.sugar,0.01), quantile(Whitewine$residual.sugar,0.99))
## Warning: Removed 81 rows containing non-finite values (stat_boxplot).
Low and high quality wines include similar amount of sugar and the data points are quite volatile.It is difficult to trace out clear the pattern.
ggplot(aes(x = quality, y = density), data = Whitewine) +
geom_boxplot(aes(group = quality)) +
ylim(0.9871,quantile(Whitewine$density,0.99))
## Warning: Removed 49 rows containing non-finite values (stat_boxplot).
High quality wines clearly have low density on average.
ggplot(aes(x = quality, y = volatile.acidity), data = Whitewine) +
geom_boxplot(aes(group = quality)) +
ylim(quantile(Whitewine$volatile.acidity,0.01), quantile(Whitewine$volatile.acidity,0.99))
## Warning: Removed 82 rows containing non-finite values (stat_boxplot).
It is difficult to detect for high quality wines and volatile acidity when extremes are trimmed.
ggplot(aes(x = quality, y = chlorides), data = Whitewine) +
geom_boxplot(aes(group = quality)) +
ylim(quantile(Whitewine$chlorides,0.01),
quantile(Whitewine$chlorides,0.99))
## Warning: Removed 88 rows containing non-finite values (stat_boxplot).
ggplot(aes(x = quality, y = total.sulfur.dioxide), data = Whitewine) +
geom_boxplot(aes(group = quality)) +
ylim(quantile(Whitewine$total.sulfur.dioxide,0.01), quantile(Whitewine$total.sulfur.dioxide,0.99))
## Warning: Removed 98 rows containing non-finite values (stat_boxplot).
ggplot(aes(x = quality, y = free.sulfur.dioxide), data = Whitewine) +
geom_boxplot(aes(group = quality)) +
ylim(quantile(Whitewine$free.sulfur.dioxide,0.01), quantile(Whitewine$free.sulfur.dioxide,0.99))
## Warning: Removed 90 rows containing non-finite values (stat_boxplot).
ggplot(aes(x = quality, y = citric.acid), data = Whitewine) +
geom_boxplot(aes(group = quality)) +
ylim(quantile(Whitewine$citric.acid,0.01),
quantile(Whitewine$citric.acid,0.99))
## Warning: Removed 68 rows containing non-finite values (stat_boxplot).
Free sulfur dioxide is at similar amounts for different quality levels.Different quality wines have similar citric acid amount.However, high quality wines have significantly high amount of citric acid at 9.
ggplot(aes(x = quality, y = sulphates), data = Whitewine) +
geom_boxplot(aes(group = quality)) +
ylim(quantile(Whitewine$sulphates,0.01),
quantile(Whitewine$sulphates,0.99))
## Warning: Removed 84 rows containing non-finite values (stat_boxplot).
ggplot(aes(x = quality, y = fixed.acidity), data = Whitewine) +
geom_boxplot(aes(group = quality)) +
ylim(quantile(Whitewine$fixed.acidity,0.01), quantile(Whitewine$fixed.acidity,0.99))
## Warning: Removed 75 rows containing non-finite values (stat_boxplot).
Different qualities have quite similar amounts of sulphates and fixed acidity.
Bivariate analysis shows that other than alcohol none of the variable have direct linear relationship with quality.However , some variables have relationship with quality and between each other.It can be observed that for different mixture of inputs, high and low quality of inputs can be observed.
In the analysis we have seen that Acohol, volatile acidity and residual sugar were primary features of interest.
Alcohol was found to have a positive correlation(0.43) with quality.When smoothened, it increases which can be seen in the graph.From box plot with quality , high quality wines were observed to have better quality values.
Volatile acidity was found to have neagative correlation wit quality. When smoothened, the relationship is observed better. The box plot shows that high quality levels generally havelower volatile acidity.
However, as expected for strong relationship between residual sugar and quality was not found.The only observation is that low sugar level is included either in high or low quality wines. Mid quality includes high residual sugar level.
Did you observe any intersting relationships between the other features?
Although density was not prime a primary interest it was found that it have a negative relationship with quality and significant correlation. From box-plot diagram ,it can be observed that high quality wines are less dense when trimmed.
What was the strongest relationship found?
The strongest relationship was found between alcohol and density variable. ***
Whitewine$quality_grouped <- cut(Whitewine$quality, c(2,4,7,9))
ggplot(aes(x = alcohol, y = residual.sugar), data = Whitewine) +
geom_point(aes(color = quality_grouped),
stat = 'summary', fun.y = mean) +
scale_color_brewer(type = 'seq',
guide=guide_legend(title = 'quality_grouped'))
When quality_grouped into alcoho-residual.sugar, the high-quality wines generally have high leve; alcohol. The darker blue points shows the high quality.
ggplot(aes(x = alcohol, y = residual.sugar,
color = factor(quality)), data = Whitewine) +
geom_point(alpha = 0.8, size = 1) +
ylim(quantile(Whitewine$residual.sugar, 0.01),
quantile(Whitewine$residual.sugar, 0.99)) +
geom_smooth(method = "lm", se = FALSE, size=1) +
scale_color_brewer(type = 'seq',
guide=guide_legend(title = 'Quality'))
## Warning: Removed 81 rows containing non-finite values (stat_smooth).
## Warning: Removed 81 rows containing missing values (geom_point).
## Warning: Removed 48 rows containing missing values (geom_smooth).
An important implications of the graph is thathigh quality wines generally have low residual.sugar level(less than 5).However,low level of sugar does not mean high quality wines.
Whitewine$quality_grouped <- cut(Whitewine$quality, c(2,4,7,9))
ggplot(aes(x = alcohol, y = density), data = Whitewine) +
geom_point(aes(color = quality_grouped),
stat = 'summary', fun.y = mean) +
scale_color_brewer(type = 'seq', guide=guide_legend(title = 'quality_grouped'))
ggplot(aes(x = alcohol, y = density,
color = factor(quality)), data = Whitewine) +
geom_point(alpha = 0.8, size = 1) +
ylim(quantile(Whitewine$density, 0.01),
quantile(Whitewine$density, 0.99)) +
geom_smooth(method = "lm", se = FALSE, size=1) +
scale_color_brewer(type = 'seq',
guide=guide_legend(title = 'Quality'))
## Warning: Removed 98 rows containing non-finite values (stat_smooth).
## Warning: Removed 98 rows containing missing values (geom_point).
## Warning: Removed 23 rows containing missing values (geom_smooth).
From the above plot it can be observed that most low quality wines are composed of low alcohol and high density and most high quality wines are composed of low density and high alcohol level.The medium quality are spread all over the graph.
ggplot(aes(x = alcohol, y = citric.acid), data = Whitewine) +
ylim(quantile(Whitewine$citric.acid,0.01),
quantile(Whitewine$citric.acid, 0.99)) +
geom_point(aes(color = quality_grouped),
stat = 'summary', fun.y = mean) +
scale_color_brewer(type = 'seq',
guide=guide_legend(title = 'quality_grouped'))
## Warning: Removed 68 rows containing non-finite values (stat_summary).
ggplot(aes(x = alcohol, y = citric.acid,
color = factor(quality)), data = Whitewine) +
geom_point(alpha = 0.8, size = 1) +
ylim(quantile(Whitewine$citric.acid, 0.01),
quantile(Whitewine$citric.acid, 0.99)) +
geom_smooth(method = "lm", se = FALSE, size=1) +
scale_color_brewer(type = 'seq',
guide=guide_legend(title = 'Quality'))
## Warning: Removed 68 rows containing non-finite values (stat_smooth).
## Warning: Removed 68 rows containing missing values (geom_point).
summary(Whitewine$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
summary(Whitewine$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
It is clearly seen that high quality wines have more than median citric acid level. From summary , we can see that high quality wines are distributes between median and 3rd quantilecitric.acid values.Nearly none of high quality wines produced with low citric acid values.
Whitewine$alcohol_grouped <- cut(Whitewine$alcohol,
c(7.50,9,10.50,12,16))
ggplot(aes(x = factor(quality), y = volatile.acidity),
data = Whitewine) +
geom_boxplot(aes(fill = alcohol_grouped)) +
scale_fill_brewer(type = 'seq',
guide=guide_legend(title = 'alcohol_grouped'))
The plot shows thathigh quality wines have high level alcohol and low level volatile acidity.An importantinference can be found that difference conditional on alcohol is seen as acidity increases.
ggplot(aes(x = alcohol, y = chlorides), data = Whitewine) +
ylim(quantile(Whitewine$chlorides, 0.01),
quantile(Whitewine$chlorides, 0.99)) +
geom_boxplot(aes(fill = alcohol_grouped)) +
scale_fill_brewer(type = 'seq',
guide=guide_legend(title = 'alcohol_grouped'))
## Warning: Removed 88 rows containing non-finite values (stat_boxplot).
There exist a negative relationship between quality and chlorides.Low chlorides and high alcohol level shows a good quality measure.
ggplot(aes(x = factor(quality),
y = total.sulfur.dioxide), data = Whitewine) +
geom_boxplot(aes(fill = alcohol_grouped)) +
scale_fill_brewer(type = 'seq',
guide=guide_legend(title = 'alcohol_grouped'))
High quality wines are generally cummulated below total.sulfur.dioxide value of 150.
In multivariate analysis, quality and alcohol values are grouped and factorized to get more better visualization.
High quality wines were having high level of alcohol of residual sugar , low density , high citric acid, low chlorides, low sulfur dioxide and low volatile acidity.
The 3rd dimension increased the quality of visualizationand pattern detection. Since relationship was non-linear, grouping alcohpl and quality wines values provided further insughts.
The medium quality wines are dispersed all over the graph.Therfore further analysis should be conducted to detect detailed visuals.
Were there any intersting interactions between features?
Alcohol and residual sugar interactions are surprising. ***
ggplot(aes(x = quality, y = alcohol), data = Whitewine) +
geom_boxplot(aes(group = quality)) +
ggtitle('Alcohol Wine Quality Box Plot') +
labs(y = "alcohol (% by volume)",
x = "quality(score between 0 and 10)")
Alcohol and quality has a positive relationship (after quality of 5) and the relationship is very close to linear. Although there exist fewer data points, there exist a negative relationship between quality values of 3 to 5. However, when extremes are trimmed as in the first graph, it is easier to observe the trend.
ggplot(aes(x = factor(quality),
y = volatile.acidity) , data = Whitewine) +
geom_boxplot(aes(fill =alcohol_grouped)) +
scale_fill_brewer(type = 'seq',
guide=guide_legend(title = 'alcohol_grouped')) +
ggtitle('Wine Quality Volatile Acidity by Alcohol Graph') +
labs(x = "volatile acidity (acetic acid - g / dm^3) ",
y = "quality(score between 0 and 10)")
The graph shows that there exist a negative relationship betwwen volitile acidity and quality. We can also observe that high quality wines include high alcohol level. Furthermore, it can be seen that the seperation of alcohol in high volatile acidity increases.
ggplot(aes(x = alcohol, y = residual.sugar), data = Whitewine) +
geom_point(alpha = 0.1, position = position_jitter(h=0), color = 'orange') +
ylim(0,quantile(Whitewine$residual.sugar, 0.95)) +
coord_trans(y = 'sqrt') +
geom_smooth() +
ggtitle('Alcohol Residual Sugar Graph') +
labs(x = "Alcohol(% by volume)",
y = "Residual Sugar(g / dm^3)")
## `geom_smooth()` using method = 'gam'
## Warning: Removed 240 rows containing non-finite values (stat_smooth).
## Warning: Removed 240 rows containing missing values (geom_point).
The negative relationship between alcohol and residual sugar is detached. Although the variance is quite high, the smoothing curve shows the average residual sugar by alcohol. It is interesting to see that residual.sugar decreased by increasing alcohol significantly. ***
This analysis is conducted to explore features of white wines and their relationships among them. The main purpose of this was to feature values in high quality and low quality wines.This Analysis has helped to extract the ain features of high quality wines and low quality wines.The middle quality wines generally do not have extream values.Although mid quality wines are dispersed all over the graph.
Box plots of features by quality values helpes to detect small differences among groups which were quite impossible to do from point and line graphs.
Although there are some features at different values in citric acid level, chorides, residual sugar and density levels which have same high and low wines quality values.From this analysis common properties for high quality wines can be extracted and any company can use it .
Some limitations from all the interpretations made difficult.The 10 point scale may be the limitation.The wine types is suppressed between quality 3 and 9.
There were no 10 point wines and not 1 or 2 which made most of the wines middle quality.This may be due to ceiling and floor effect on the quality ratings on the wines. ***